Server Monitoring

What is Server Monitoring?

Server monitoring is the continuous process of tracking a server's performance, health, and availability, irrespective of whether these servers are hosted on physical machines, virtual machines, or on cloud instances. It involves collecting and analyzing data on various aspects such as CPU usage, memory utilization, disk activity, network performance, and more. The primary goal is to ensure servers operate optimally, preventing downtime and performance bottlenecks. Enterprise grade monitoring tools can provide real-time alerts and historical data, enabling IT administrators to proactively identify issues, plan capacity upgrades, and troubleshoot problems efficiently.

Why is Server Monitoring important?

Server monitoring has always played an essential role in maintaining the reliability of IT infrastructure, optimizing resource allocation, and ultimately delivering uninterrupted services to users. Servers are the heart of any IT infrastructure. To ensure peak performance of applications, the server hardware must be working well, the servers should be sized correctly to handle the workload, and there should be no resource bottlenecks.

Why is Server Monitoring challenging?

Heterogeneous server infrastructure: The variety of server hardware and operating systems in use makes it impossible for one administrator to manage all of these technologies. Organizations often have multiple administrators and multiple tools for hardware and OS monitoring and there is no integration or correlation between them. Consequently, problem diagnosis is often a long and manual process.
Dynamic and virtualized infrastructure: The increasing use of server virtualization is also changing the way in which server monitoring needs to be done. In fact, there are more virtual servers being deployed today than physical servers. Conventional performance monitoring tools and processes used for managing physical servers are no longer sufficient for monitoring virtualized servers.
Cloud hosted servers: Many organizations now deploy servers via hyperscale vendors’ cloud VMs on platforms such as Google, Azure and AWS. Selecting the correct VM instance type and family for a server workload has become essential to optimize the balance between performance and cost and to capacity plan. Good monitoring tools will allow administrators to right-size their purchasing of cloud VMs.
Shared responsibilities: Often the team that ultimately provides the server hardware is in a different team or a third-party supplier to the administrators who run and maintain the server software and applications or services it hosts. Particularly with cloud hosted server infrastructure, administrators need diagnosis tools to demonstrate when another team or third-party is responsible for resolving an issue. When using cloud infrastructure abstracted from many of the metrics and server events available to the on-prem administrators this is challenging without monitoring tools designed specifically for that purpose. Monitoring and diagnostic tools must now provide administrators with sufficient data to raise support calls with cloud suppliers and/or other teams.
Alert Storms: A server failure or issue typically impacts other infrastructure, OS and application tiers. If a server fails, the effects can impact a large stack. For example, a rogue backup application or security scan consuming all of the CPU on VMware server will deprive the hosted VMs of CPU, in turn this will mean applications hosted in the VMs become unavailable or unusably slow with end users experiencing issues and frustration. Organizations typically monitor end user experience, application performance and availability and so on, meaning that a huge number of alerts can be triggered from a server event. These secondary effects can lead to helpdesks being inundated. Modern monitoring tools provide event correlation, alert suppression and automated root-cause diagnostics and will pinpoint server issues to avoid alarm and alert storms.
Automation and Autoscaling: In a world where containerization, virtualization and microservices are common, automation and autoscaling mean that virtualized servers are often spun-up on demand and scaled-back. This means a dynamic, transient and ephemeral IT landscape and monitoring needs to be automated too as manual configuration is no longer practical.

Which key server performance metrics should be monitored?

Core basic metrics that should be monitored, generally include:

CPU utilization	Memory utilization	Disk space utilization
Top processes by CPU/memory/Disk	Disk activity levels	Network traffic
Hardware status	Status of daemon processes	Handles usage
Page file usage	Server uptime	Network interface status
Network connectivity	Windows service status	File share status
TCP traffic	Syslog errors	Event log errors
Handles used	Time sync status	Context switches

Note: metrics may vary slightly between servers especially Linux / Unix vs Windows Servers

More details on key metrics and KPIs to track when monitoring servers is given in: Server Monitoring – KPIs & Metrics | eG Innovations.

Server Uptime vs. Server Availability

Two of the most important metrics leveraged in SLAs and SLOs are Server Uptime and Server Availability. Occasionally they are described interchangeably but are very different measures. See: What is Server Uptime Monitoring? (eginnovations.com).

Uptime and availability are not the same

Uptime and availability are often used interchangeably but they are not the same:

Uptime is the amount of time a server is up and operational. It is usually an internal measure of the server – i.e., it is reported by the server itself.
Availability is the percentage of time, in a specific time interval, during which a server is available for its intended purpose. For example, network availability of a server can be measured by pinging the server.

Availability is usually an external check, unlike up time, which is an internal check.

What type of workloads do servers typically host?

Business critical workloads hosted on servers which administrators monitor and optimize often include:

File storage servers
Application servers
Network servers
Web servers
Database servers

Monitoring tools can be installed onto virtually any server, whether they be on-premises or in the cloud. If you work with a third-party cloud services provider, they will have their own monitoring tools in place, but using your own tools to monitor cloud server performance often provides an added layer of protection against downtime.

Because workloads vary so much based on use case and organizational demands, it is very important to select monitoring tools that integrate with the specific technologies in use. For example, if deploying web servers you would need to use a product that supported the specific vendor and stack in use whether that be Microsoft IIS, Apache or Nginx.

What alerts should i configure? How should i set up alerts for specific server conditions or events?

As workloads vary so widely modern observability and monitoring tools rely on AIOps technologies that leverage machine learning, statistical analysis and algorithms to learn what is normal baseline behavior for the customers individual IT deployments. Products such as eG Enterprise and other AIOps observability solutions will set up metric thresholds and alerting automatically combining learned knowledge with vendor best practice advice for alerting.

If you are using tools that require manual configuration and tuning for alerting, some advice on how to establish stable thresholds and how to decide what the "right" value for a threshold is, is available, here: White Paper | Make IT Service Monitoring Simple & Proactive with AIOps Powered Intelligent Thresholding & Alerting.

Monitoring server migration to the cloud

Using a monitoring product that supports your legacy on-prem server landscape as well as the cloud services you intend to use to support servers will allow you to baseline your on-prem technologies to right-size cloud deployments and to also measure the success of migration projects. Many products support hybrid cloud architectures where some infrastructure remains on-prem. Monitoring tool licensing varies considerably and if considering migration projects, it can be advisable to choose tools that will allow you to migrate licenses between on-prem and cloud technologies e.g., from Microsoft SQL Server on-prem to SQL hosted on Azure VMs.

Does Server Monitoring have to be agent-based or can it be agentless?

Server monitoring is usually agent-based. Agentless monitoring is also supported by some tools, but it must be noted that there are several caveats with agentless monitoring. Firstly, agentless monitoring requires remote access to admin accounts for monitoring the servers. This can cause unexpected security issues. Secondly, agentless monitoring often takes up a lot more bandwidth for monitoring. Hence, agent-based monitoring is preferred.

Can Server Monitoring be used to track the configuration of the server?

Yes, many Server Monitoring tools such as eG Enterprise monitor both the performance of the server and its configuration. Configuration changes such as OS version changes, new software being installed, hot fixes applied, etc. can be tracked and reported. See: Configuration Management & Change Tracking for Observability (eginnovations.com) for more information.

Server observability

Legacy monitoring tools have focused on resource metrics and sometimes logs or traces to provide the information that answers the question, “does the server have an issue”. Modern observability tools move beyond monitoring to include intelligent root-cause diagnostics by correlating metrics, logs, traces and other data to provide administrators with answers to the question “why does the server have an issue”.